Goto

Collaborating Authors

 visual query



Rethinking Causal Mask Attention for Vision-Language Inference

Pei, Xiaohuan, Huang, Tao, Ma, YanXiang, Xu, Chang

arXiv.org Artificial Intelligence

Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.


MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Mishra, Divyanshu, Saha, Pramit, Zhao, He, Hernandez-Cruz, Netzahualcoyotl, Patey, Olga, Papageorghiou, Aris, Noble, J. Alison

arXiv.org Artificial Intelligence

Accurate standard plane acquisition in fetal ultrasound (US) videos is crucial for fetal growth assessment, anomaly detection, and adherence to clinical guidelines. However, manually selecting standard frames is time-consuming and prone to intra-and inter-sonographer variability. Existing methods primarily rely on image-based approaches that capture standard frames and then classify the input frames across different anatomies. This ignores the dynamic nature of video acquisition and its interpretation. To address these challenges, we introduce Multi-Tier Class-A ware Token Transformer (MCA T); a visual query-based video clip localization (VQ-VCL) method to assist sonographers by enabling them to capture a quick US sweep. By then providing a visual query of the anatomy they wish to analyze, MCA T returns the video clip containing the standard frames for that anatomy, facilitating thorough screening for potential anomalies. We evaluate MCA T on two ultrasound video datasets and a natural image VQ-VCL dataset based on Ego4D. Our model outperforms state-of-the-art methods by 10% and 13% mtIoU on the ultrasound datasets and by 5.35% mtIoU on the Ego4D dataset, using 96% fewer tokens. MCA T's efficiency and accuracy have significant potential implications for public health, especially in low-and middle-income countries (LMICs), where it may enhance prenatal care by streamlining standard plane acquisition, simplifying US based screening, diagnosis and allowing sonographers to examine more patients.


SketchQL Demonstration: Zero-shot Video Moment Querying with Sketches

Wu, Renzhi, Chunduri, Pramod, Shah, Dristi J, Aravind, Ashmitha Julius, Payani, Ali, Chu, Xu, Arulraj, Joy, Rong, Kexin

arXiv.org Artificial Intelligence

In this paper, we will present SketchQL, a video database management system (VDBMS) for retrieving video moments with a sketch-based query interface. This novel interface allows users to specify object trajectory events with simple mouse drag-and-drop operations. Users can use trajectories of single objects as building blocks to compose complex events. Using a pre-trained model that encodes trajectory similarity, SketchQL achieves zero-shot video moments retrieval by performing similarity searches over the video to identify clips that are the most similar to the visual query. In this demonstration, we introduce the graphic user interface of SketchQL and detail its functionalities and interaction mechanisms. We also demonstrate the end-to-end usage of SketchQL from query composition to video moments retrieval using real-world scenarios.


Visual Diagrammatic Queries in ViziQuer: Overview and Implementation

Ovčiņņikiva, Jūlija, Šostaks, Agris, Čerāns, Kārlis

arXiv.org Artificial Intelligence

Knowledge graphs (KG) have become an important data organization paradigm. The available textual query languages for information retrieval from KGs, as SPARQL for RDF-structured data, do not provide means for involving non-technical experts in the data access process. Visual query formalisms, alongside form-based and natural language-based ones, offer means for easing user involvement in the data querying process. ViziQuer is a visual query notation and tool offering visual diagrammatic means for describing rich data queries, involving optional and negation constructs, as well as aggregation and subqueries. In this paper we review the visual ViziQuer notation from the end-user point of view and describe the conceptual and technical solutions (including abstract syntax model, followed by a generation model for textual queries) that allow mapping of the visual diagrammatic query notation into the textual SPARQL language, thus enabling the execution of rich visual queries over the actual knowledge graphs. The described solutions demonstrate the viability of the model-based approach in translating complex visual notation into a complex textual one; they serve as semantics by implementation description of the ViziQuer language and provide building blocks for further services in the ViziQuer tool context.


Natural Language Aided Visual Query Building for Complex Data Access

Pan, Shimei (IBM Watson Research Center) | Zhou, Michelle (IBM Almaden Research Center) | Houck, Keith (IBM Watson Research Center) | Kissa, Peter (IBM Watson Research Center)

AAAI Conferences

Over the past decades, there have been significant efforts on developing robust and easy-to-use query interfaces to databases. So far, the typical query interfaces are GUI-based visual query interfaces. Visual query interfaces however, have limitations especially when they are used for accessing large and complex datasets. Therefore, we are developing a novel query interface where users can use natural language expressions to help author visual queries. Our work enhances the usability of a visual query interface by directly addressing the "knowledge gap" issue in visual query interfaces. We have applied our work in several real-world applications. Our preliminary evaluation demonstrates the effectiveness of our approach.